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1. Introduction 



Shannon's entropy maximization (MaxEnt) interpreted by its propo- 
nent, E.T. Jaynes, as a 'method for inference from incomplete infor- 
mation' (Jaynes, 1991), has a debatable relationship to the Bayesian 
method (see (Jaynes, 1988), (Zellner, 1988), (Seidenfeld, 1987), (Ufiink, 
k>( I 1995), (Golan, Judge and Miller, 1996)), and to the Maximum Likeli- 

^ • hood method (see (Jaynes, 1982), (Golan, 1998), (Mohammad- Djafari, 

1998), (Grendar and Grendar, 1999)). MaxEnt has also been presented 
as a method for assigning prior probabilities (see (Jaynes, 1998)), ((Sei- 
denfeld, 1987), (Uffink, 1995) for a critique), and as an extension of the 
principle of insufficient reason. On the theoretical level MaxEnt, as a 
statistical method, is in dispute for decades. 

MaxEnt has been successfully applied to linear inverse problems 
without noise that arise in many branches of science. Its scope was 
substantially extended to the inverse problems with noise by (Golan, 
Judge and Miller, 1996), and further by (van Akkeren and Judge, 1999), 
(Mittelhammer, Judge and Miller, 2000), who have put MaxEnt into a 
context of extremum estimation and inference methods. 
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Theoretical disputes on the status of MaxEnt among other statistical 
methods generally ignore a fundamental question: 'What is the ques- 
tion MaxEnt answers?'. Two other problems involving 'What kind of 
constraints are relevant to bind the entropy maximization?' and 'What 
if information on amount of data is also available?' considerations were 
left on margin of interest both by Jaynes and opponents of MaxEnt. 

In this article we address three elementary problems of the Max- 
Ent method. These problems involve lack of satisfactory probabilistic 
rationale, ad-hockery regarding constraints to bind MaxEnt and im- 
possibility of the method to take an information on amount of data 
into consideration. 

Concerning the problem of rationale, we shall prove and demonstrate 
that MaxEnt is a special, asymptotic case of more general and self- 
evident 'principle/method' - MaxProh - that focuses on looking for a 
vector of absolute frequencies (occurrences) in a feasible set of vectors, 
which has maximal probability of being generated by a prior generator 
pmf. MaxEnt is also a special case of MaxProb in the sense of assum- 
ing a specific uniform pmf prior generator. In fact, the work provides 
probabilistic rationale of a more general, relative entropy maximiza- 
tion (REM) method. REM is also commonly known as /-divergence 
minimization, or Kullback-Leibler directed distance minimization. As 
a by-product of providing the MaxProb rationale of MaxEnt, the re- 
maining two problems concerning the issues of relevant constraints and 
the known-sample-size are also resolved. 

The article is organized as follows: Section 2 reviews an up-to- 
date status of work on the three above mentioned problems. Section 
3 introduces MaxProb and states the main Theorem 1 together with 
an example illustrating the point. Section 4 summarizes briefly conse- 
quences of Theorem 1 for MaxEnt. Proof of the Theorem is given in 
Appendix A. Appendix B contains a bonus. 



2. Three MaxEnt problems 

We start with a brief review of three problems pertaining to MaxEnt . 
MaxEnt, in the realm of statistics, lacks a proper rationale. Moment 
based probability distribution recovered through Shannon's entropy 
maximization have been characterized as the smoothest, the flattest 
, the most uniform given a constraints, the least prejudiced, the one 
that maximizes our ignorance while including the available statistical 
data. These adjectives usually serve as the main rationale for employ- 
ing Shannon's entropy criterion. Rarely, two justifications Jaynes had 
developed are recalled. The first one, Wallis' multiplicity argument, is 
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presented in Chapter 11 of (Jaynes, 1998). We give it form of theorem 
and highlight the central role played by Jaynes' limiting process. 

Wallis— Jaynes Theorem. Let N 6e a discrete multivariate m-dimensional 
random variable from multinomial distribution 

, m 

1=1 

where YllLi ''^i — '^^ '^'^^ ^ = [1/"^; 1/m, . . . , 1/m]' is uniform. 
Then under Jaynes' limiting process 

1) n ^ oo 

2) Ui ^ CO for i = 1,2, . . . ,m 

3) Ui/n -^ Pi, where pi is a constant, for i = 1,2, ... ,m, 
holds 

n 

where H{p) = —YllLiPi^'^Pi ^-^ Shannon's entropy (up to an additive 
constant, — Inm^. 

Proof. Due to Stirhng's formula (there is a typo at the Stirhng's ap- 
proximation (see formula (11-17)) in the Jaynes' book) 

Inn! = nlnn - n + ln(\/27rn) + l/(12n) + 0{l/n^) 

holds 



In 7r(n|q) = n In n — n + ln( v2vrn) + l/(12n) 

m mm m 

-(^ Uilnui-^Ui + Y^ ln(V2^) + ^ l/(12ni)) 

1=1 i=l i=l i=l 

m 

+ ^ni\nqi + 0{l/n^) 



Taking into account that X^™ ^^ ni = n, and the first two assumptions 
of the Jaynes' limiting process give 

ln7r(n|q) n / ^ V^ i i 

> in n — ( 1/nj > n^ in Ui — mm 

n ^-^ 

1=1 

where the RHS is — y^"ii — In — —In m, which leads thanks to the third 
assumption of the limiting process to the claim of the Theorem. D 
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In order to provide a rationale for Shannon's entropy maximization 
one has to investigate each particular set of constraints binding max- 
imization of 7r(n|q) to determine whether the constraints induce just 
the Jaynes' limiting process. Yet it has not been done, and it seems to 
be not the case, even for the simplest, moment consistency constraints 
(mcc), traditionally accompanying the entropy maximization. For, even 
in the case of mcc, there is for any n at least one vector n in conflict 
with the second requirement of Jaynes' limiting process. 

Thus, Wallis' argument, rather than offering explanation of Max- 
Ent, presents yet another (and interesting) partial limit of multino- 
mial distribution, in addition to the well-known DeMoivre-Laplace and 
the Poisson one. It could be better called 'Wallis-Jaynes local limit 
theorem'. 

The second rationale, Jaynes proposed has been dubbed 'The En- 
tropy Concentration Theorem', (see (Jaynes, 1979), (Jaynes, 1982), 
or (Seidenfeld, 1987)). This rationale rests on the Wallis' argument 
and Jaynes' limiting process and again leads to the question about its 
relevance to constraints bound entropy maximization. 

It is well known that MaxEnt can not incorporate information on 
amount of data (see for instance (Uffink, 1996)), or in other words, Max- 
Ent recovers the same pmf regardless of the sample size. This second 
problem of MaxEnt, is usually left open, although it directly relates to 
the next, more frequently debated, third problem concerning the kind 
of constraints relevant to bind the entropy maximization. According 
to (Jaynes, 1979), constraints should represent testable information, 
i.e. they should be a test basis concerning the probability distribution. 
For instance a set of samples is not testable, value of, let's say, third 
moment is. Such a state of art also reveals lack of internal coherence of 
the views on entropy maximization. 

Three interconnected questions are 'Just what are we accomplishing 
when we maximize entropy?' (Jaynes, 1982), 'What kind of constraints 
are allowed to bind the entropy maximization?' and 'What if informa- 
tion on amount of data is also available?'. Vagueness of answers to them 
has given to rise in context of linear inverse problems to yet another 
question: 'Why MaxEnt?'. This question has been addressed by several 
axiomatizations (see (Shore and Johnson, 1980), (Tikochinsky, Tishby 
and Levine, 1984), (Skilling, 1988), (Csiszar, 1991), (Paris and Ven- 
covska, 1997), (Garrett, 1999)) which single out the Shannon's entropy 
(or relative entropy) as the only function, consistent with the axioms. 
Though these axiomatizations answer the most pragmatic 'Why Max- 
Ent?' question, they leave unresolved the first three ('interpretational') 
problems. 
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3. REM/MaxEnt — as an asymptotic case of MaxProb 

Let us consider following general setup. 

Let q = [qi-,q2i ■ ■ ■ iQrril' be a pmf, defined on m-element support, 
referred to as prior generator . 

Let Tin be a set of all vectors {ni, n2, . . . , nj}, such that an adding- 
up constraint X]S=i ^«i — ^' ^'^^ 3 — 1) 2, . . . , J, is satisfied, n will be 
referred to as occurrence vector , Tin as occurrence-vector working set. 

Let P be a set of all probability vectors, such that 'Y^^iPi = 1. 

Then, a simple question can be asked. 

Question 1. What is the most probable occurrence vector h, among 
occurrence vectors n from the working set Tin, to be generated by the 
prior generator q? 

The answer to the question is 



h = arg max 7r(n|q) (3.1) 

neHn 



where 



vr(n|q) = ^ ^ J] ^^ (3-2) 

1=1 

is the probability of generating the occurrence vector n by a prior 
generator q. 

Definition 1. The above setup and Question 1, leading to task (3.1), 
(3.2) will be referred to as MaxProb. 

The following theorem states the main result on asymptotic equiv- 
alence of MaxProb and REM/MaxEnt. The proof is developed in the 
Appendix A. 

Theorem 1. Let q be the prior generator and Tin be the working set. 
Let h be the most probable occurrence vector from the working set Tin, 
to be generated by the prior generator q. And let n ^ oo. Then 

n 



where 



and 



- = P 

n 



argmaxi7(p, q) 



H{p,ci} = -^pilni — 



is the relative entropy of probability vector p on generator q. 
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Corollary. // also some other differentiable constraint F{—) = is 
employed to form the working set Tin, and a corresponding constraint 
-F(p) = is added to the relative entropy maximization, the claim of 
Theorem 1 remains valid. 

Note. If the prior generator is uniform, H{p, q) reduces to Shannon's 
entropy H{p) = - YT=iPi^'^Pi- 

Example. Let q = [0.13 0.09 0.42 0.36]'. Let Tin consists of all occur- 
rence vectors {ni, n2, . . . , nj} such that 



y^ nij = n for j = 1,2,... , J 
1=1 

y riijXi = 3.2n for j = 1,2, . . . ,J 



1=1 

m 



(3.3) 



1=1 



where x = [1 2 3 4]'. 

Table 1 shows | and J, for n = 10,50,100,500,1000, together 
with probability vector p maximizing relative entropy H{p, q) under 
constraints 



i=l 
m 

'^PiXi = 3.2 



(3.4) 



i=l 



Results for uniform prior generator are in the fourth column of the 
table. 



4. Concluding remarks 

As a way of summing up we make the following points: 

1) The multiplicity argument, as it is left to us by Boltzmann, Wallis, 
Jaynes, does not provide satisfying rationale for MaxEnt. However it 
does provide a clue where to search for it. 

2) Theorem 1, as any asymptotic theorem, can be interpreted in two 
directions: either moving towards infinity or moving back to finiteness. 
The first direction provides the MaxProb rationale for REM/MaxEnt. 
The second direction shows, that proper place of REM/MaxEnt is only 
in the asymptotic. 
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Table I. MaxProb and REM/MaxEnt 



J 



— j uniform prior 



10 


10 


50 


154 


100 


574 


500 


13534 


1000 


53734 



0.1000 0.0000 0.5000 0.4000 
0.0800 0.0600 0.4400 0.4200 
0.0800 0.0700 0.4200 0.4300 
0.0820 0.0700 0.4140 0.4340 
0.0830 0.0700 0.4110 0.4360 



0.1000 0.1000 0.3000 0.5000 
0.0800 0.1400 0.2800 0.5000 
0.0800 0.1400 0.2800 0.5000 
0.0780 0.1460 0.2740 0.5020 
0.0790 0.1460 0.2710 0.5040 



0.0826 0.0709 0.4103 0.4361 



0.0788 0.1462 0.2714 0.5037 



3) MaxProb also automatically solves the known-sample-size prob- 
lem. 

4) Proof of Theorem 1 implies, that constraints that bind REM/MaxEnt 
should be differentiable. 

5) Hidden behind (3.1), (3.2) is an assumption about iid sampling. 
Thus, REM/MaxEnt as an asymptotic form of MaxProb, seem to be 
limited to the iid case. 

The probabilistic rationale for REM/MaxEnt//-divergence mini- 
mization may also be extended to provide a rationale of J-divergence 
minimization , see (Grendar and Grendar, 2000). 

Area of applicability of MaxProb/REM/MaxEnt should be obvious 
- wherever Question 1 is reasonable to ask. 

Philosophical consequences of Theorem 1 are left to the reader. 
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Proof. 1) 



Appendix 
A. Proof of Theorem 1 

max 7r(n|q) 

n 

subject to 

m 
i=\ 



For the purpose of maximization 7r(n|q) can be /o(7-transformed, 
into 

m m 

v{n) = 7(1 + ^ n,) - Y, tK + 1) + ^ 

1=1 i=l 

where 7(-) = Inr(-), r(-) is gamma-function, and / = ^^l?^^lng^. 
Necessary condition for maximum of 7r(n|q) than is 

m 
di'(n) = Y^ [-7'(ni + 1) + Inqi] dm = (A.l) 

i=l 

since, according to the assumed adding-up constraint, J^iLi '^"■j — ^■ 
First, it win be proved that 

hm [y'{ni) - ln(ni + A;)] = (A.2) 

n^— >oo 

for any rij, and any k. 

The first derivative of 7 can be written in a form of infinite series 
(see (Fichtengohz, 1969)) 



where 



l'{ni) = g{ni) - C 



^Vj + 1 3 + niJ 
C = Euler's constant 
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For rii G Z, denoted fii, the series g{ni) reduces into a harmonic 
series H 



■rii-l ^ 



Let Hi = hi + h, where < /i < 1. Then 

111 1 1 1 

< r--. -. r < 



j + 1 j + hi j + 1 j + ni + h j + 1 j + nj + 1 ' 
so 

Since, difference of the major series converges to zero, 

hm [g{hi + 1) - giui)] = lim — = 

also 

hm [g{n,) - g{ni)] = 0. (A.3) 

n^— >oo 

Due to a known property of harmonic series 

lim [H{ni) — ln(nj + k) — C] = 0, for any k 

rii—KX) 

holds also 

hm [g{hi) - Inim + k) - C] = (A.4) 



thus, adding ( |A.3|) , (|A.4|) up gives 

hm [5f(ni) — ln(nj + A;) — C] = 0, 

rii— ►oo 

respectively, for the derivative (recall that 7'(ni) = g{ni) — C) 
lim [7'(nj) - ln(nj + A;)] = 0, 



what is just ( |A.2| ). 

Without loss of generality, for n ^ oo we can restrict for sub-space 
of rii -^ oo, for i = 1, 2, . . . , m, if on the sub-space exists the maximum. 
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10 



so the conditions of niaxiniuni for rij ^ oo take, thanks to ( |A.2| ), form 

m 

y^ [— In rij + In (7j] dnj = (A. 5a) 



i=l 



E 



rij = n 



(A.5b) 



Due to Lemma 1, with k = -, system ( A. 5 ) can be transformed into 
an equivalent one 



III 

yLin!^+lnJd^=0 
^-^ In J n 

m 



i=l 



(A.6a) 
(A.6b) 



2) 



max ^(p, q) 
p 

subject to 

rn 
i=l 

Thus necessary conditions for maximum of relative entropy H(p, q) 
constrained by the respective adding-up constraint are 



diJ(p,q) = ^[- In Pi + In qi]dpi = 

i= 
m 



i=l 



j=l 



Comparing ( A.7 ) and ( A. 6 ) completes the proof. 



(A. 7a) 



(A.7b) 



D 



Note. Claim of the Corollary of Theorem 1 is immediately implied by 
the proof. 

Lemma 1. 

dv{kn) = kdv{n) 

Proof. df(A;n) = X^^i[— In /crij + \nqi\dkni = kY^^^[—\nk — Inn^ + 
\nqi]dni = feZlI^iI" li^"-* + lngj]dnj = A;dt;(n). D 
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B. Expected Occurrence Vector (ExpOc) and 
REM/MaxEnt 

Also, a question different than the Question 1 can be asked. 

Question 2. What is the expected occurrence vector n, of occurrence 
vectors n from the working set Tin, which can be generated by the prior 
generator q? 

Specifying the notion of expectation, 

_ E/=i^K-|q)"j ,^^. 

n = — '— (B.l) 

E,=i^(nilq) 



Theorem 2. Let q be the prior generator and Tin be the working set. 

Let n be the expected occurrence vector of the working set Tin, to be 

generated by the prior generator q. And let n -^ oo. Then, using 

notation of Theorem 1, 

n 

- = P 

n 

Corollary. If also a moment consistency cosntraint is employed to 
form the working set Tin, cind corresponding constraint is added to the 
relative entropy maximization, the claim of Theorem 2 remains valid. 

Note. GeneraUty of Corollary 2 is under both numerical and theoretical 
investigations. Theorem 2 and its Corollary, although yet supported 
only by numerical calculations, indicates that MaxEnt/REM can be 
as well an asymptotic form of another method/principle - ExpOc - 
which concentrates on looking for an expected occurrence vector in the 
feasible set of vectors generated by a prior generator /pmf. 

Following example illustrates the point of Theorem 2/Corollary 2. 

Example. Assuming the same setup as in the previous Example, ex- 
pected occurrence vectors for n = [10, 50, 100, 1000] and generators 
q = [0.13 0.09 0.42 0.36]', q = [0.25 0.25 0.25 0.25]' are in Table 2. 
Note that — converges to p faster than — . 
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Table II. ExpOc and REM/MaxEnt 



— I uniform prior 



10 

50 
100 
500 
1000 



0.0721 0.0736 0.4365 0.4178 
0.0806 0.0714 0.4153 0.4327 
0.0816 0.0712 0.4128 0.4344 
0.0824 0.0710 0.4108 0.4358 
0.0825 0.0709 0.4106 0.4360 



0.0701 0.1510 0.2877 0.4912 
0.0771 0.1471 0.2745 0.5013 
0.0779 0.1466 0.2729 0.5025 
0.0786 0.1463 0.2717 0.5035 
0.0787 0.1462 0.2715 0.5036 



0.0826 0.0709 0.4103 0.4361 



0.0788 0.1462 0.2714 0.5037 
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